13 research outputs found
Optimization Theory for ReLU Neural Networks Trained with Normalization Layers
The success of deep neural networks is in part due to the use of
normalization layers. Normalization layers like Batch Normalization, Layer
Normalization and Weight Normalization are ubiquitous in practice, as they
improve generalization performance and speed up training significantly.
Nonetheless, the vast majority of current deep learning theory and non-convex
optimization literature focuses on the un-normalized setting, where the
functions under consideration do not exhibit the properties of commonly
normalized neural networks. In this paper, we bridge this gap by giving the
first global convergence result for two-layer neural networks with ReLU
activations trained with a normalization layer, namely Weight Normalization.
Our analysis shows how the introduction of normalization layers changes the
optimization landscape and can enable faster convergence as compared with
un-normalized neural networks.Comment: To be presented at ICML 202
Learning Expressive Prompting With Residuals for Vision Transformers
Prompt learning is an efficient approach to adapt transformers by inserting
learnable set of parameters into the input and intermediate representations of
a pre-trained model. In this work, we present Expressive Prompts with Residuals
(EXPRES) which modifies the prompt learning paradigm specifically for effective
adaptation of vision transformers (ViT). Out method constructs downstream
representations via learnable ``output'' tokens, that are akin to the learned
class tokens of the ViT. Further for better steering of the downstream
representation processed by the frozen transformer, we introduce residual
learnable tokens that are added to the output of various computations. We apply
EXPRES for image classification, few shot learning, and semantic segmentation,
and show our method is capable of achieving state of the art prompt tuning on
3/3 categories of the VTAB benchmark. In addition to strong performance, we
observe that our approach is an order of magnitude more prompt efficient than
existing visual prompting baselines. We analytically show the computational
benefits of our approach over weight space adaptation techniques like
finetuning. Lastly we systematically corroborate the architectural design of
our method via a series of ablation experiments.Comment: Accepted at CVPR (2023
A theory for undercompressive shocks in tears of wine
We revisit the tears of wine problem for thin films in water-ethanol mixtures
and present a new model for the climbing dynamics. The new formulation includes
a Marangoni stress balanced by both the normal and tangential components of
gravity as well as surface tension which lead to distinctly different behavior.
The prior literature did not address the wine tears but rather the behavior of
the film at earlier stages and the behavior of the meniscus. In the lubrication
limit we obtain an equation that is already well-known for rising films in the
presence of thermal gradients. Such models can exhibit non-classical shocks
that are undercompressive. We present basic theory that allows one to identify
the signature of an undercompressive (UC) wave. We observe both compressive and
undercompressive waves in new experiments and we argue that, in the case of a
pre-coated glass, the famous "wine tears" emerge from a reverse
undercompressive shock originating at the meniscus
Wasserstein Diffusion Tikhonov Regularization
We propose regularization strategies for learning discriminative models that
are robust to in-class variations of the input data. We use the Wasserstein-2
geometry to capture semantically meaningful neighborhoods in the space of
images, and define a corresponding input-dependent additive noise data
augmentation model. Expanding and integrating the augmented loss yields an
effective Tikhonov-type Wasserstein diffusion smoothness regularizer. This
approach allows us to apply high levels of regularization and train functions
that have low variability within classes but remain flexible across classes. We
provide efficient methods for computing the regularizer at a negligible cost in
comparison to training with adversarial data augmentation. Initial experiments
demonstrate improvements in generalization performance under adversarial
perturbations and also large in-class variations of the input data
SAFE: Machine Unlearning With Shard Graphs
We present Synergy Aware Forgetting Ensemble (SAFE), a method to adapt large
models on a diverse collection of data while minimizing the expected cost to
remove the influence of training samples from the trained model. This process,
also known as selective forgetting or unlearning, is often conducted by
partitioning a dataset into shards, training fully independent models on each,
then ensembling the resulting models. Increasing the number of shards reduces
the expected cost to forget but at the same time it increases inference cost
and reduces the final accuracy of the model since synergistic information
between samples is lost during the independent model training. Rather than
treating each shard as independent, SAFE introduces the notion of a shard
graph, which allows incorporating limited information from other shards during
training, trading off a modest increase in expected forgetting cost with a
significant increase in accuracy, all while still attaining complete removal of
residual influence after forgetting. SAFE uses a lightweight system of adapters
which can be trained while reusing most of the computations. This allows SAFE
to be trained on shards an order-of-magnitude smaller than current
state-of-the-art methods (thus reducing the forgetting costs) while also
maintaining high accuracy, as we demonstrate empirically on fine-grained
computer vision datasets
Your representations are in the network: composable and parallel adaptation for large scale models
We propose InCA, a lightweight method for transfer learning that
cross-attends to any activation layer of a pre-trained model. During training,
InCA uses a single forward pass to extract multiple activations, which are
passed to external cross-attention adapters, trained anew and combined or
selected for downstream tasks. We show that, even when selecting a single
top-scoring adapter, InCA achieves performance comparable to full fine-tuning,
at a cost comparable to fine-tuning just the last layer. For example, with a
cross-attention probe 1.3% the size of a pre-trained ViT-L/16 model, we achieve
performance within 0.2% of the full fine-tuning paragon at a computational
training cost of 51% of the baseline, on average across 11 downstream
classification. Unlike other forms of efficient adaptation, InCA does not
require backpropagating through the pre-trained model, thus leaving its
execution unaltered at both training and inference. The versatility of InCA is
best illustrated in fine-grained tasks, which may require accessing information
absent in the last layer but accessible in intermediate layer activations.
Since the backbone is fixed, InCA allows parallel ensembling as well as
parallel execution of multiple tasks. InCA achieves state-of-the-art
performance in the ImageNet-to-Sketch multi-task benchmark.Comment: Accepted to NeurIPS 202
Recommended from our members
Part I: The geometry and manipulation of natural data for optimizing neural networks Part II: A theory for undercompressive shocks in tears of wine
In Part I of the thesis, we present a body of work analyzing and deriving data-centric regularization methods for the effective training of machine learning models. Machine learning and deep learning in particular have been highly successful in computer vision and generative modelling in recent years. Nonetheless, the progress of such approaches crucially relies on effective regularization, architectural, and algorithmic choices that are often abstracted away during a first consideration. In this part we present the reader with effective regularization approaches focused on the geometry and biases of natural data and parameterization of deep neural networks. We start by deriving a regularization to accurately capture geometric robustness and natural variances of images in Chapter 1. This approach enables significant improvement in model robustness and relies on the theory of optimal transport which we introduce alongside with our method in the chapter. Dataset regularization is extended to active manipulation of the sampling distribution as opposed to each datum in Chapter 2. In the chapter, we present a general and differentiable technique for dataset optimization enabling de-biasing of noisy and imbalanced datasets. In our final contribution for Part I, In Chapter 3, we study the interplay between data and model parameterization. This concerns with the widely-spread architectural approach of neural network normalization. We analyze the convergence dynamics of Weight Normalization and present the first proof of global convergence for dynamically normalized ReLU networks when trained with gradient descent.In Part II, we study the fluid dynamics phenomena known as the tears of wine problem for thin films in water-ethanol mixtures and present a model for the climbing dynamics. The new formulation includes a Marangoni stress balanced by both the normal and tangential components of gravity as well as surface tension which lead to distinctly different behavior. The prior literature did not address the wine tears but rather the behavior of the film at earlier stages and the behavior of the meniscus. In the lubrication limit we obtain an equation that is already well-known for rising films in the presence of thermal gradients. Such models can exhibit nonclassical shocks that are undercompressive. We present basic theory that allows one to identify the signature of an undercompressive wave. We observe both compressive and undercompressive waves in new experiments and we argue that, in the case of a preswirled glass, the famous âwine tearsâ emerge from a reverse undercompressive shock originating at the meniscus